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Abstract 

Consider an agent interacting with an environment in cycles. In every in- 
teraction cycle the agent is rewarded for its performance. We compare the 
average reward U from cycle 1 to m (average value) with the future discounted 
reward V from cycle k to oo (discounted value). We consider essentially ar- 
bitrary (non-geometric) discount sequences and arbitrary reward sequences 
(non-MDP environments) . We show that asymptotically U for m —* oo and 
V for k — > oo are equal, provided both limits exist. Further, if the effective 
horizon grows linearly with k or faster, then the existence of the limit of U 
implies that the limit of V exists. Conversely, if the effective horizon grows 
linearly with k or slower, then existence of the limit of V implies that the 
limit of U exists. 
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1 Introduction 



We consider the reinforcement learning setup |RN03t IHut05j , where an agent inter- 
acts with an environment in cycles. In cycle k, the agent outputs (acts) a*., then 
it makes observation and receives reward r&, both provided by the environment. 
Then the next cycle k+1 starts. For simplicity we assume that agent and environ- 
ment are deterministic. 

Typically one is interested in action sequences, called plans or policies, for agents 
that result in high reward. The simplest reasonable measure of performance is the 
total reward sum or equivalently the average reward, called average value U\ m : = 
— [ri+...+r m ], where m should be the lifespan of the agent. One problem is that 
the lifetime is often not known in advance, e.g. often the time one is willing to let a 
system run depends on its displayed performance. More serious is that the measure 
is indifferent to whether an agent receives high rewards early or late if the values 
are the same. 

A natural (non-arbitrary) choice for m is to consider the limit m — > oo. While 
the indifference may be acceptable for finite m, it can be catastrophic for m = oo. 
Consider an agent that receives no reward until its first action is — b, and then 
once receives reward ^++. For finite m, the optimal k to switch from action a to b 
is k opt = m. Hence k opt ^-oo for m— >oo, so the reward maximizing agent for m^oo 
actually always acts with a, and hence has zero reward, although a value arbitrarily 
close to 1 would be achievable. (Immortal agents are lazy |Hut05t Sec. 5. 7]). More 
serious, in general the limit Uioo may not even exist. 

Another approach is to consider a moving horizon. In cycle k, the agent tries to 
maximize Uk m - = m ^ k+l [ r fc + ...+r m ], where m increases with k, e.g. m = k+h—l with 
h being the horizon. This naive truncation is often used in games like chess (plus a 
heuristic reward in cycle m) to get a reasonably small search tree. While this can 
work in practice, it can lead to inconsistent optimal strategies, i.e. to agents that 
change their mind. Consider the example above with h = 2. In every cycle k it is 
better first to act a and then b (£4 m = ^fc+^fc+i = + ^-), rather than immediately 
b (Ukm = rk+rk+i = ^- + 0), or a,a (Uk m = + 0). But entering the next cycle k+1, 
the agent throws its original plan overboard, to now choose a in favor of b, followed 
by b. This pattern repeats, resulting in no reward at all. 

The standard solution to the above problems is to consider geometri- 
cally=exponentially discounted reward |Sam37l IBT961 [SB98 . One discounts the re- 
ward for every cycle of delay by a factor 7<1, i.e. considers Vfc 7 :=(l— 7)^^ fc 7*" fc rj. 
The Vi 7 maximizing policy is consistent in the sense that its actions ak,a,k+iy co- 
incide with the optimal policy based on Vjt 7 . At first glance, there seems to be 
no arbitrary lifetime m or horizon h, but this is an illusion. is dominated by 
contributions from rewards rj-.-.r^onnj- 1 )) so has an effective horizon h e " wliry -1 . 
While such a sliding effective horizon does not cause inconsistent policies, it can 
nevertheless lead to suboptimal behavior. For every (effective) horizon, there is a 
task that needs a larger horizon to be solved. For instance, while h e ^ = 5 is sufficient 
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for tic-tac-toe, it is definitely insufficient for chess. There are elegant closed form 
solutions for Bandit problems, which show that for any 7 < 1, the Bayes-optimal 
policy can get stuck with a suboptimal arm (is not self-optimizing) pF85llKV 86j. 

For 7 — >1, h e H — >oo, and the defect decreases. There are various deep papers 
considering the limit 7— s>l Kel81 j; and comparing it to the limit m— >oo Ka kOlj . 
The analysis is typically restricted to ergodic MDPs for which the limits lim 7 ^ 1 V l7 
and linin^oof/im exist. But like the limit policy for m— >oo, the limit policy for 7— »1 
can display very poor performance, i.e. we need to choose 7<1 fixed in advance (but 
how?) , or consider higher order terms |Mah961 1AA92] • We also cannot consistently 
adapt 7 with k. Finally, the value limits may not exist beyond ergodic MDPs. 

There is little work on other than geometric discounts. In the psychology and 
economics literature it has been argued that people discount a one day=cycle delay 
in reward more if it concerns rewards now rather than later, e.g. in a year (plus 
one day) |FLU02j . So there is some work on "sliding" discount sequences oc 
7o r fc + 7i r fc+i + ---- One can show that this also leads to inconsistent policies if 7 is 
non-geometric St rod \'\\TTT] . 

Is there any non-geometric discount leading to consistent policies? In [Hut02j 
the generally discounted value Vfc 7 := f-Y^kl^i with Tk ■ = Yl^kHi < 00 nas Deen 
introduced. It is well-defined for arbitrary environments, leads to consistent policies, 
and e.g. for quadratic discount 7fc = l//c 2 to an increasing effective horizon (propor- 
tionally to k), i.e. the optimal agent becomes increasingly farsighted in a consistent 
way, leads to self-optimizing policies in ergodic (fcth-order) MDPs in general, Ban- 
dits in particular, and even beyond MDPs. See |Hut02j for these and |Hut05| for 
more results. The only other serious analysis of general discounts we are aware of 
is in |BF85j . but their analysis is limited to Bandits and so-called regular discount. 
This discount has bounded effective horizon, so also does not lead to self-optimizing 
policies. 

The asymptotic total average performance Uioo and future discounted perfor- 
mance Voo-y are of key interest. For instance, often we do not know the exact envi- 
ronment in advance but have to learn it from past experience, which is the domain of 
reinforcement learning SB98J and adaptive control theory |KV86j . Ideally we would 
like a learning agent that performs asymptotically as well as the optimal agent that 
knows the environment in advance. 

Contents and main results. The subject of study of this paper is the relation 
between Uioo and for general discount 7 and arbitrary environment. The im- 
portance of the performance measures U and V, and general discount 7 has been 
discussed above. There is also a clear need to study general environments beyond er- 
godic MDPs, since the real world is neither ergodic (e.g. losing an arm is irreversible) 
nor completely observable. 

The only restriction we impose on the discount sequence 7 is summability (Fi < 
00) so that Vfc 7 exists, and monotonicity (7&>7fc+i)- Our main result is that if both 
limits Uioo and exist, then they are necessarily equal (Section [7[ Theorem IT^j) . 
Somewhat surprisingly this holds for any discount sequence 7 and any environment 
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(reward sequence r), whatsoever. 

Note that limit Uxoo ma y exist or not, independent of whether exists or not. 
We present examples of the four possibilities in Section |21 Under certain conditions 
on 7, existence of Ui^ implies existence of VJ X)7 , or vice versa. We show that if (a 
quantity closely related to) the effective horizon grows linearly with k or faster, then 
existence of 11^ implies existence of V A OC7 and their equality (Sectional Theorem IT5|). 
Conversely, if the effective horizon grows linearly with k or slower, then existence 
of implies existence of Ui^ and their equality (Section® Theorem fTTj) . Note 
that apart from discounts with oscillating effective horizons, this implies (and this 
is actually the path used to prove) the first mentioned main result. In Sections El 
and |U we define and provide some basic properties of average and discounted value, 
respectively. 

2 Example Discount and Reward Sequences 

In order to get a better feeling for general discount sequences, effective horizons, 
average and discounted value, and their relation and existence, we first consider 
various examples. 

Notation. In the following we assume that i,k,m,n £ IN are natural num- 
bers, F_:=]im n F n = lim k ^ 00 mi n>k F n denotes the limit inferior and F:=\im n F n = 
lim fc ^ oc sup n>fc F n the limit superior of F n , V'n means for all but finitely many n, 7= 
(71,72,...) denotes a summable discount sequence in the sense that IV = J^fc7i<oo 
and 7^ £ M + Wk, r = (ri,r 2r ..) is a bounded reward sequence w.l.g. r k £ [0,1] Wk, 
constants a, (3 £ [0,1], boundaries 0<k 1 <m 1 <k 2 < m 2 <k 3 < total average value 
Uim '■= ^YlT=i r i ( see Definition [TUJ) and future discounted value V kl = f^X^fc7i r i 
(see Definition [P2|) . The derived theorems also apply to general bounded rewards 
r,i £ [a, b] by linearly rescaling £ [0,1] and U ~> and V^> 

Discount sequences and effective horizons. Rewards r k+ h give only a small 
contribution to V kl for large h, since ■y k +h — ->0. More important, the whole reward 
tail from k + h to 00 in V kl is bounded by p-[7jfc+/ l +7fc+/ 1+ i+...], which tends to zero 
for h— >oo. So effectively V kl has a horizon h for which the cumulative tail weight 
^k+h/^k is, say, about |, or more formally h k := min{^i > : T k+h < \T k }. The 
closely related quantity h q k uasi ■=T k / r ) k , which we call the quasi-horizon, will play an 
important role in this work. The following table summarizes various discounts with 
their properties. 
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For instance, the standard discount is geometric 7fc = 7 fc for some 0<7< 1, with con- 
stant effective horizon ■ (An agent with 7 = 0.95 can/wiii not pian farther than 
about 10-20 cycles ahead). Since in this work we allow for general discount, we can 
even recover the average value U\ m by choosing 7fc = {q for t>m }• ^ power discount 
7fc = k~ a (a > 1) is very interesting, since it leads to a linearly increasing effective 
horizon hf oak, i.e. to an agent whose farsightedness increases proportionally with 
age. This choice has some appeal, as it avoids preselection of a global time-scale like 
m or j^-, and it seems that humans of age k years usually do not plan their lives for 
more than, perhaps, the next k years. It is also the boundary case for which Uioo 
exists if and only if exists. 

Example reward sequences. Most of our (counter)examples will be for binary 
reward r G {0,1}°°. We call a maximal consecutive subsequence of ones a 1-run. 
We denote start, end, and length of the nth run by k n , m n — l, and A n = m n — k n , 
respectively. The following 0-run starts at m n , ends at k n+ i — 1, and has length 
B n = k n+ i — m n . The (non-normalized) discount sum in 1 /0-run n is denoted by a n 
I b n , respectively. The following definition and two lemmas facilitate the discussion 
of our examples. The proofs contain further useful relations. 

Definition 1 (Value for binary rewards) Every binary reward sequence r G 
{0,1}°° can be defined by the sequence of change points < k\ < mi < k 2 < m 2 < ... 
with 

r k = 1 -<=>- k G [J<S n , where S n :— {k G JN : k n < k < m n }. 

n 

The intuition behind the following lemma is that the relative length A n of a 
1-run and the following 0-run B n (previous 0-run -B n _i) asymptotically provides a 
lower (upper) limit of the average value U\ m . 

Lemma 2 (Average value for binary rewards) For binary r of Definition 
let A n : = m n — k n and B n :=k n+ i — m n be the lengths of the nth 1/0-run. Then 

a then U loo = lim n £7i, fc „_i = a 
(3 then U loo = lim„ U ltmn -x = (3 

In particular, ifa = (3, then Uioo = a = (3 exists. 

Proof. The elementary identity U lm = U ltm _ 1 + ^(r m -U 1 ^ 1 )'^U ltm ^ 1 if r m = {\} 
implies 

Uik n < U lm < U\, mn -\ for k n <m< m n 
U 1>kn+1 -i < U lm < U 1>mn for m n <m< k n+1 



If 



A n -\-B n 

An 
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inf Ui kn < U lm < sup U l mn -x Vm > k no 

n > n m >n 

Hm[/ lfcn = U loo < U loo = limt/i >mn _i (1) 

n n 

Note the equalities in the last line. The > holds, since (Uik n ) and (Ui >mn -i) are 
subsequences of {U\ m ). Now 

If t4tt >«V« then fA fc _i = 2 zl^^fx^^ — > a Vn (2) 

This implies inf n A A " B <inf n L r i j fc n _i. If the condition in (J2J) is initially (for a finite 
number of n) violated, the conclusion in (J2J) still holds asymptotically. A standard 
argument along these lines shows that we can replace the inf by a lim, i.e. 

hm A ^ Bn < limL r i jfcn _i and similarly lim A ^ Bn > limt/i^-i 



Together this shows that lim n C/i i fc n _i = a exists, if lim n A A ™ B = a exists. Similarly 

If s^TJ:>^n then = ^^±^ > (3 W (3) 

where B :=0. This implies inf n g +A -^n^mn-ii an d an asymptotic refine- 
ment of 

lim Bn ^ +An < lmC/i,m„-i and similarly lim Bn J^ n > limE/i jmn _i 



Together this shows that \im. n Ui mri _i = 3 exists, if lim ra p A " . — 3 exists. ■ 

Similarly to Lemma the asymptotic ratio of the discounted value a n of a 1-run 
and the discount sum b n of the following (6 n _i of the previous) 0-run determines the 
upper (lower) limits of the discounted value Vfc 7 . 

Lemma 3 (Discounted value for binary rewards) For binary r of Definition 
let a n :=J2^ k ~ n 1 li = Tk n -T mn and b n :=^Zm~ 1 li = T m n -T kn+1 be the discount 
sums of the nth 1 /0-run. Then 

V d^T~* a then ^oo 7 = lim nK l „ 7 = « 

If -> 13 then F 007 = lim n V knl = (3 

In particular, ifa = 3, thenV 001 = a = 3 exists. 

Proof. The proof is very similar to the proof of Lemma El The elementary identity 
V ky = V k+lr/ + ^(r k -V k+lr/ )^V k+1 ^ if r k = {\} implies 

V mnl < Vfc 7 < 14 n7 for k n < k <m n 
V mnl < 14 7 < 14 n+17 for m n <k< k n+1 
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inf V mnl < \4 7 < sup V knl Vfc > k no 

n>n m>n 

=> )mV mnl = Zoo 7 < Vooy = HnT\4 n7 (4) 

n n 

Note the equalities in the last line. The > holds, since {V knl ) and (V mnJ ) are 
subsequences of (V^). Now if -^5- > (3 Vn > n then 7 = °" + a "+i + •■■ > 3 
\/n>n . This implies 

lim a " L < hmVL- and similarly lim °", > limVL-v 

n n it it 

Together this shows that lim„Vfe n7 = (3 exists, if lim w - a " b = f3 exists. Similarly if 
. a 2 +1 > a Vn > n then V m 7 = . a ^ + a ™+*+- > J Vn > n . This implies 

lim r < lim V m ~ and similarly lim " n+1 > hmV m y 

ran 

Together this shows that lim,,! 7 ™ = « exists, if lim ra , a " +1 =a exists. ■ 



Example 4 (t/ioo = V^,) Constant rewards r fe = o; is a trivial example for which 
f^ioo = Kx>7 = Q ; exist and are equal. 

A more interesting example is r = l 1 2 l 3 4 ... of linearly increasing 0/1-run-length 
with A n = 2n — l and B n = 2n, for which C/ loo = | exists. For quadratic discount 
^ = fc(FTT)' usin S T k = l, h q k uast =k+l = Q(k), fc n = (2n-l)(n-l)+l, m n = (2n-l)ra+l, 
ara = rfc„ — r»n„ = , A " ~ ttt, and & n = r mr , — r*. , . = — — ~ ttt, we also get 14™ = ^. 
The values converge, since they average over increasingly many 1/0-runs, each of 
decreasing weight. 

Example 5 (simple L/ioo^- V^,) Let us consider a very simple example with 
alternating rewards r— 101010... and geometric discount r )k = l k - It is immediate 
that U loo = \ exists, but V coy = V 2k ^ = T ^< j^ = V 2k ~i, 1 = V 001 . 

Example 6 (U loo ^'V (xrf ) Let us reconsider the more interesting example r = 
1 1 2 1 3 4 ... of linearly increasing 0/1-run-length with A n = 2n — 1 and B n = 2n for 
which f/ loo = | exists, as expected. On the other hand, for geometric discount 7fc = 7 fc , 
using T k = -^- and a n = T kn -T mn = ^[1- 7 A ™] and 6„ = r mn -Lfc n+1 = £^[1- 7 B "], 
i.e. ^-~7 A ™-^0 and ^±i~7 B ™^0, we get ]/ OO7 = a = 0<l = / 3 = 1/ OO7 . Again, this is 
plausible since for k at the beginning of a long run, V kl is dominated by the reward 
0/1 in this run, due to the bounded effective horizon of geometric 7. 

Example 7 (V^y 96- C^ioo) Discounted may not imply average value on sequences 
of exponentially increasing run-length like r = 1 1 2 1 4 8 1 16 ... with A n = 2 2n ~ 2 = k n 
and J B n = 2 2n - 1 = m n for which U loo = = § < f = g-^- =F loo , i.e. C/ loo does 

not exist. On the other hand, V r 007 exists for a discount with super-linear horizon 
like 7fe = [/cln 2 /c] _1 , since an increasing number of runs contribute to V kl : T k ^^:, 
hence T kn ~ (2n „ 2)ln2 and T mn ~ (2n _\ )ln2 , which implies a n = T kn -T mn ~ [4n 2 ln2]- x ~ 
Lm n ^k n+ i = bni VJ X>7 = 2 exists. 
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Example 8 (Non-monotone discount 7, E/ioo 7^ ^007) Monotonicity of 7 in 
Theorems El El and El is necessary. As a simple counter-example consider al- 
ternating rewards r 2 k = with arbitrary 72^ and r 2 k-i = 1 with j2k-i — 0, which 
implies Vfc 7 = 0, but 11^ = \. 

The above counter-example is rather simplistic. One may hope equivalence to 
hold on smoother 7 like 2^1— >1. The following example shows that this condition 
alone is not sufficient. For a counter-example one needs an oscillating 7 of constant 
relative amplitude, but increasing wavelength, e.g. 7^ = [2 + cos(7rv / 2fc)]/fc 2 . For the 
sequence r = 1 1 2 1 3 4 ... of Example El we had £/ loo = |. Using m n = |(2n— 1) 2 + | 
and fc n+ i = |(2n + |) 2 + |, and replacing the sums in the definitions of a n and b n by 
integrals, we get «n~^[f-|] and &n~^[| + ^], which implies that V^^-^ 
exists, but differs from Uioo = \. 

Example 9 (Oscillating horizon) It is easy to construct a discount 7 for which 
sup fc p^ = 00 and sup fc y^ = 00 by alternatingly patching together discounts with 
super- and sub-linear quasi-horizon h q ^ asi . For instance choose 7^ oc 7 fc geometric 
until ^:<^, then 7fcOC^^ harmonic until j^>n, then repeat with n~»n+l. The 
proportionality constants can be chosen to insure monotonicity of 7. For such 7 
neither Theorem El no r Theorem El is applicable, only Theorem El 



3 Average Value 

We now take a closer look at the (total) average value U\ m and relate it to the future 
average value £4 m , an intermediate quantity we need later. We recall the definition 
of the average value: 

Definition 10 (Average value, U\ m ) Let r $ £ [0,1] be the reward at time iElN. 
Then 



U lm : = -VV; e [0,1] 



m 

i=l 

is the average value from time 1 to m, and Ux^^lim^^^Uijn the average value if 
it exists. 

We also need the average value Uk m :— m \ +1 TlT=k r i fr° m ktom and the following 
Lemma. 

Lemma 11 (Convergence of future average value, U^oo) For k m < m — > 00 

and every k we have 

=>■ U k m —> a if sup^ 2 ^- < 1 

TT ■■ ~. j-k TT ■■ ~. m m 

-<= t4 m m — > « 
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The first equivalence states the obvious fact (and problem) that any finite initial 
part has no influence on the average value C/ioo- Chunking together many Uk m m 
implies the last <=. The =>- only works if we average in Uk m m over sufficiently many 
rewards, which the stated condition ensures (r = 101010... and fc m = m is a simple 
counter-example). Note that Ukm k ~^ ol for rrik >k^oo implies Ui mk — > a, but not 
necessarily U\ m — > a (e.g. in Example U\ mk = | and — > imply Uk mk — > | by 
(jSJ), but [/loo does not exist). 

Proof. The trivial identity mUi m = (k—l)Ui t k-i + {rn—k+l)Ukm implies Uk m —Ui m = 
m _fc+i (^im-^i,fc-i) implies 

\U km -U lm \ < m (5) 

ifc-i 

The numerator is bounded by 1, and for fixed k and m— >oc the denominator 
tends to oo, which proves <^>. 

=>■) We choose (small) e>0, m £ large enough so that |Z7i TO — a| <e Vm>m £ , and 
m>^. If k := k m <m e , then © is bounded by . If k := k m > m £ , then © is 

bounded by j-^j, where c: = sup fc fc "^ 1 < 1. This shows that \Uk mm — U\ m \ —0(e) for 
large m, which implies Uk mm — y ot. 

<=) We partition the time-range {l...m} = |J^ =1 {fc mn ...m n }, where m\:=m and 
m n+ i := k mn — 1. We choose (small) e>0, m e large enough so that |£4 mm — a| < e 
Vm>m £ , m>^, and Z so that k mi <m e <m\. Then 



lm 



1 

m 



n=l n =l+l 
I 



(m n -k mn +l)U kri 



< — y (m n -k mn + l){a + e) ^ 

m *-~t m 

n=l 

< mi ~ kmi + 1 (a + e) + ^ < (a + e)+e 

m m 

c- -i i tt ^ m 1 -k mi + l m-m e 

similarly U\ m > (a — e) > fa — e) > (1 — e)[a — e) 

m m 

This shows that \U\ m — a\<2e for sufficiently large m, hence Ui m —>a. 



4 Discounted Value 

We now take a closer look at the (future) discounted value Vfc 7 for general discounts 
7, and prove some useful elementary asymptotic properties of discount 7^ and nor- 
malizer IV We recall the definition of the discounted value: 
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Definition 12 (Discounted value, ~\4 7 ) Let T{ G [0,1] be the reward and 7« > 
a discount at time i G IV, where 7 is assumed to be summable in the sense that 
0<r fe : = JX fe7i <oo. Then 

V kl := -V^i e [0,1] 

the j -discounted future value and V^ 07 :=limfc_ >00 Vfc 7 its limit if it exists. 

We say that 7 is monotone if 7fc+i < 7fcVA;. Note that monotonicity and I\ > 
V/c implies 7& > V/c and convexity of IV 

Lemma 13 (Discount properties, 7/r) 

7fc 7fc 

«) ^-0 ^ ^ ^-1 VAeW 

Tfc Tfe Tfc 

Furthermore, (1) implies (ii), but not necessarily the other way around (even not if 
7 is monotone). 

Proof. (i)=> I^^ntfe 1 ^^ 1 ' since A is finite - 
(i)<*= Set A = l7* 

(m) The first equivalence follows from ^ = 7^ + ^+1. The proof for the second 
equivalence is the same as for (i) with 7 replaced by T. 
(i)=>(ii) Choose e>0. (i) implies ^^->l—s V'/c implies 

00 00 i— 1 00 

r fc = $> = i k J2Il— * 7,£(i-er fc = lk/e 

i=k i=k j=k 7* i=k 

hence r^<£ V'/c, which implies >0. 

(i)^=(ii) Consider counter-example 7^ = 4~r iog2fc l, j. e . ^ = 4-™ f or 2 n_1 </c<2 n . 
Since r fc > V~ 7 i = 2-™- 1 we have 0<^<2 1 ~ ri ^0, but ^ = ^1 for A; = 2 re . ■ 



5 Average Implies Discounted Value 

We now show that existence of lim m C/i m can imply existence of linifcl4 7 and their 
equality. The necessary and sufficient condition for this implication to hold is 
roughly that the effective horizon grows linearly with k or faster. The auxiliary 
quantity Uk m is in a sense closer to \4 7 than U\ m is, since the former two both aver- 
age from k (approximately) to some (effective) horizon. If 7 is sufficiently smooth, 
we can chop the area under the graph of \4 7 (as a function of k) "vertically" ap- 
proximately into a sum of average values, which implies 
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Proposition 14 (Future average implies discounted value, E/oo^V^y) 

Assume k<mk^ oo and monotone 7 with — > 1 . IfUkm k ^a, then V^-^ a. 

The proof idea is as follows: Let k\ = k and fc n+ i = mfc n +l. Then for large k we 

get 

V ki = j^XH^ 7 ^ ~ Y r .^ lkn(ykn+1 ~ kn ^ Uknmk " 
k n=l i=k n ^ n =i 

00 00 m k n 

~ p-J^TfcnCJVH-l-fcn) ~ ^E5Z 7i = " 

k n=l k n=l i=k n 

The (omitted) formal proof specifies the approximation error, which vanishes for 
k— >oo. 

Actually we are more interested in relating the (total) average value U\oo to the 
(future) discounted value Vfc 7 . The following (first main) Theorem shows that for 
linearly or faster increasing quasi- horizon, we have V r 007 = L r i 00 , provided the latter 
exists. 



Theorem 15 (Average implies discounted value, L/ioo =>■ V^) 

A ssume sup fc ^ < 00 and monotone 7 . If XJ\ m ^a, then 14 7 — >a. 

For instance, quadratic, power and harmonic discounts satisfy the condition, but 
faster-than-power discount like geometric do not. Note that Theorem ITB1 does not 
imply Proposition ITU 

The intuition of Theorem ITBI for binary reward is as follows: For U\ m being able 
to converge, the length of a run must be small compared to the total length m up 
to this run, i.e. o(m). The condition in Theorem ITBI ensures that the quasi-horizon 
h k masi = fl(k) increases faster than the run-lengths o(k), hence \4 7 ~ Ukn(k) ~ U lm 
(Lemma 111(1 asymptotically averages over many runs, hence should also exist. The 
formal proof "horizontally" slices Vfc 7 into a weighted sum of average rewards Ui m . 
Then U\ m — >a implies V^ — >a. 

Proof. We represent as a ^-weighted mixture of Ux/s for j > k, where 5j : = 
7j— 7j + i >0. The condition 00 >c>^=:Cfc ensures that the excessive initial part 
oc^i,fc-i is "negligible". It is easy to show that 

00 00 

= 7^ and y^Jdj = (k-l)j k + T k 

j=i j=k 
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We choose some (small) e>0, and m e large enough so that \U\ m — a\ <e Vm>m e . 
Then, for k>m e we get 

^ oo ^ oo oo ^ oo j 

1 k . , J- A; . , ■ • 1 fc . , . , 

i=k i=k ]=% 3=k i=k 



- oo 
1 oo 

1 1 — 



j=k 

= h(k-l) lk + r fc ](a ± e) - ^ lk (k-l)(a =f e) 
= a ± (1 + 1 ^ )1k je ^ a ± (1 + 2c k )e 

i.e. |Vfc 7 — a| < (l + 2cfc)e< (l+2c)e Wk>m E , which implies Vfc 7 — >a. ■ 

Theorem can, for instance, be applied to Example 0] Examples El El and |H1 
demonstrate that the conditions in Theorem fTol cannot be dropped. The following 
proposition shows more strongly, that the sufficient condition is actually necessary 
(modulo monotonicity of 7), i.e. cannot be weakened. 



Proposition 16 (U loo 7^ Voo 7 ) 

For every monotone 7 with sup fc ^ = 00, there are r for which Uioo exists, but 
not Voo 7 . 

The proof idea is to construct a binary r such that all change points k n and m n 
satisfy rfc n ~2r mn . This ensures that V knl receives a significant contribution from 
1-run n, i.e. is large. Choosing k n+ i 3> m n ensures that V mnl is small, hence V kl 
oscillates. Since the quasi-horizon h q k uasi ^Q(k) is small, the 1-runs are short enough 
to keep U\ m small so that £7x00 = 0. 

Proof. The assumption ensures that there exists a sequence m 1; m 2 , m 3 , ... for 
which 



r, 



> n 2 We further (can) require T mn < |r mn _ 1+ i (m := 0) 



in,, 



For each m n we choose k n such that T kn ^2T mn . More precisely, since T is monotone 
decreasing and T mn <2T mn <r mn _ 1+1 , there exists (a unique) k n in the range m n _i < 
k n < m n such that r fcn+1 < 2Y mn < T kn . We choose a binary reward sequence with 



12 



r'k — 1 iff k n <k< m n for some n. This implies 



2 . m«7m„ ™n ("in ~K~ l)7n 

n < 



m n k n m n k n 1 i 1 1 7m n 1 ^ 2 



m„ m„ m„ n 2 r m „ n 2 n 2 

Tn n ' k n i 



n 



1 1 k 

Ui mn < — [fez - 1] H }\m n i-k n >] < 

m n m n ^ m n ^ m n > 

n'=l n—l 

n 



k\ sr^ 2 ki l 
< — + > — =■ < — - H 

Tn n n' m n I — 1 



n'=l 



hence by (JTJ) we have [7i 00 = lim n L r i jmn _i V/, hence {7ioo = 0. On the other hand 

j y p 

This shows that cannot converge to an a < 1. Theorem IT^l and — implies 
that Vfe 7 can also not converge to 1, hence does not exist. ■ 



6 Discounted Implies Average Value 

We now turn to the converse direction that existence of can imply existence 
of C/ioo and their equality, which holds under a nearly converse condition on the 
discount: Roughly, the effective horizon has to grow linearly with k or slower. 

Theorem 17 (Discounted implies average value, V^y => E/ioo) 

Assume sup fc p^<oo and monotone 7. IfV^—^a, then U\ m — >a. 

For instance, power or faster and geometric discounts satisfy the condition, but 
harmonic does not. Note that power discounts satisfy the conditions of Theorems 
ITol and\H\ i.e. Uioo exists iff in this case. 

The intuition behind TheoremElfor binary reward is as follows: The run-length 
needs to be small compared to the quasi-horizon, i.e. o(/i^" asJ ), to ensure convergence 
of Vfc 7 . The condition in Theorem IT7I ensures that the quasi-horizon /i^" asi = 0(fe) 
grows at most linearly, hence the run- length o(m) is a small fraction of the sequence 
up to m. This ensures that U\ m ceases to oscillate. The formal proof slices U\ m 
in "curves" to a weighted mixture of discounted values Vk T Then Vfc 7 — >a implies 
Z7i m — >a. 
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Proof. We represent Uk m as a (0 < bj- weighted) mixture of V^- 7 for k<j<m. The 
condition c:= sup fc p^ < oo ensures that the redundant tail <xV m +i,j is "negligible". 
Fix k large enough so that |Vj 7 — a \ <e Wj > k. Then 



E 6 > 

j=k 



m 7 oo 
0.- 



j=k 
m 



F 7 

j'=fc J i=m+l 



(6) 



6, 



i=fc \j=k 



TV 



In order for the first term on the r.h.s. to be a uniform mixture, we need 

^ i i 



Ef 1 



(k < i < m) 



— k + 1 

Setting i = /c and, respectively, subtracting an i^i — 1 term we get 

= and — - = j | ; > for k < / < ;// 

r fc 7 fc m - fc + 1 



(7) 



7i 7»_i 7 m - A; + 1 



So we can evaluate the fe-sum in the l.h.s. of © to 



j=k 



1 



m — k + 1 
1 



E 



Lj=fc+i 



m — k + 1 
1 + 



E 

j=k 



7 m (m - fc + 1) 



7i 7i-i 

7? 7? 
— . 1 -|- c m 



+ — 

7fc 



+ 



r 



m+l 
7m 



(8) 



where we shifted the sum index in the second equality, and used Tj — Tj + i = 'jj in 
the third equality Inserting and (jHJ) into (jOJ we get 



r 



m+l 



i=k 



m — k + 1 



7 m (m - fc + 1) 



Kn+1,7 ^"fcm + c m (a±e) 



Note that the excess c m over unity in (jHJ) equals the coefficient of the tail contribution 
Kn+i,7- The above bound shows that 

|C^m-«| < (l + 2c m )e <(l + 4c)e for m > 2k 

Hence U m /2 tm —>a, which implies U\ m — >a by Lemma ITT1 ■ 

Theorem El can, for instance, be applied to Example Examples |7] and |H] 
demonstrate that the conditions in Theorem IT7I cannot be dropped. The following 
proposition shows more strongly, that the sufficient condition is actually necessary, 
i.e. cannot be weakened. 
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Proposition 18 (Voq^t^- E/ioo) 

For every monotone 7 wift sup A .p^ = oo, there are r for which exists, but 
not t/ioo. 



Proof. The assumption ensures that there exists a sequence fci, k 2 , k 3 , ... for which 

n ^ kn < — We further choose k n+1 > 8k n 
We choose a binary reward sequence with r k = l iff k n <k<m n : = 2k n . 



1 x 1 x 

Vfe n7 = — lh + ■■■ + 72fc ; -i < — k Wh 

kn l=n 

< < yl < 1 



rv - ;2 - n _ 1 

l=n 1 l=n 

which implies = by (0J). In a sense the 1-runs become asymptotically very 
sparse. On the other hand, 

Ui, mn -i > ^[r kn + ••• +r mn -i] = ^[m n -k n ] = \ but 
hence L^oo does not exist. ■ 



7 Average Equals Discounted Value 

Theorem El and El together imply for nearly all discount types (all in our table) 
that U\oo = Voo 7 if f/ioo and both exist. But Example El shows that there are 
7 for which simultaneously sup fc p^ = oo and sup fc y^ = oo, i.e. neither Theorem ITBT 
nor Theorem El applies. This happens for quasi-horizons that grow alternatingly 
super- and sub-linear. Luckily, it is easy to also cover this missing case, and we get 
the remarkable result that Ui^ equals V A OC7 if both exist, for any monotone discount 
sequence 7 and any reward sequence r, whatsoever. 

Theorem 19 (Average equals discounted value, Ui 00 = V <x> - y ) 

Assume monotone 7 and that Uioo and exist. Then U\ 00 = V 001 . 



Proof. Case 1, sup fc p^<oo: By assumption, there exists an a such that V ky — >a. 
Theorem IT7I now implies Ui m — >a, hence Ui 00 = V 001 = a. 

Case 2, sup fe ^ = 00: This implies that there is an infinite subsequence 
ki<k 2 <k 3 ,... for which V k Jki^ ki ^oo, i.e. c ki ■=h'y ki /T ki <c<oo. By assumption, 
there exists an a such that !7 lm — > a. If we look at the proof of Theorem El we 
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see that it still implies |Vfc i7 — a\ < (l+cjfc i )e < (l+2c)e on this subsequence. Hence 
i7 — > a. Since we assumed existence of the limit Vfc 7 this shows that the limit 
necessarily equals a, i.e. again Ui 00 = V 001 = a. ■ 

Considering the simplicity of the statement in Theorem the proof based on 
the proofs of Theorems ^] and El is remarkably complex. A simpler proof, if it 
exists, probably avoids the separation of the two (discount) cases. 

Example |H1 shows that the monotonicity condition in Theorem cannot be 
dropped. 

8 Discussion 

We showed that asymptotically, discounted and average value are the same, provided 
both exist. This holds for essentially arbitrary discount sequences (interesting since 
geometric discount leads to agents with bounded horizon) and arbitrary reward se- 
quences (important since reality is neither ergodic nor MDP). Further, we exhibited 
the key role of power discounting with linearly increasing effective horizon. First, it 
separates the cases where existence of Uioo implies/is- implied- by existence of V r OC7 . 
Second, it neither requires nor introduces any artificial time-scale; it results in an 
increasingly farsighted agent with horizon proportional to its own age. In particular, 
we advocate the use of quadratic discounting •y k = l/k 2 . All our proofs provide con- 
vergence rates, which could be extracted from them. For simplicity we only stated 
the asymptotic results. The main theorems can also be generalized to probabilis- 
tic environments. Monotonicity of 7 and boundedness of rewards can possibly be 
somewhat relaxed. A formal relation between effective horizon and the introduced 
quasi-horizon may be interesting. 
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